Are English rivers becoming more polluted?¶

Last Sunday, I came across what feels like the latest in a long line of articles condeming the state of rivers in the UK. I grew up swimming in the murky rivers of the south; the muddy stretch of Thames known in Oxford as the Isis, and then Hampshire's Itchen and Cardiff's Taf and Rhymney. It's hard to escape the feeling that river swimming (and wild swimming more generally) is becoming more popular, and maybe even ~chic. Yet, at the same time (and perhaps because of this uptick in river dipping), it also feels like our rivers are in the midst of an ecological crisis, driven by pollution and land-use change.

I wanted to investigate this idea further. Are rivers getting more polluted? Are there regional or even local differences in the state and trends of our rivers? Luckily for me, the UK government has been quietly measuring hundreds of water quality indicators at sites across the country for over 20 years. Something like 58 million measurements made on nearly 4 million samples taken from over 58 thousand locations are now available between the years 2000 and 2016. This data is freely available through the Water Quality Data Archive, either as .csv files or through a REST style API.

This workbook¶

Obviously media outlets wouldn't monger doom in the interest of more clicks and reads... but just in case, I decided to draw my own data-driven conclusions, using the Water Quality Data Archive to better understand what's really happening to our waterways.

In this workbook, I access the entire dataset and then group the measurements by region and area. I also chose to only look at every 4 years between 2000 and 2016, to minimise how much of this train's WiFi I'm hogging...

In addition to the water quality measurements, I also accessed some shapefiles for UK river catchments and some useful regional boundaries from the Ordinance Suvery Open Rivers dataset. That way, I can characterise the data based on regiona and drainage basin.

In [1]:
### import dependencies 
import numpy as np
import pandas as pd
import polars as pl
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_color_codes("dark")

import warnings
warnings.simplefilter("ignore") ## this is bad, don't do this
In [3]:
### load river catchment shapefile
catch = gpd.read_file('/Users/starr/My Drive/Files/Code Bin/datasci_portfolio/river_polution/England_shapefile/WFD_River_Water_Body_Catchments_Cycle_3.shp')

### plot catchments and area classification
f,ax = plt.subplots(1,1,figsize=(6,10))
catch.dissolve('OPCAT_NAME').plot(column='MNCAT_NAME',cmap='Blues',lw=0.15,ax=ax,label='Areas')
catch.dissolve('RBD_NAME').boundary.plot(color='xkcd:dark grey',ax=ax,lw=0.5,label='Regions')
ax.axis('off')
ax.set_title('Regions and Areas\n used to classify river data in this workbook')
sns.set_theme(style="whitegrid")
In [4]:
catch_t = catch[catch['RBD_NAME']=='Anglian']

f,ax = plt.subplots(1,1,figsize=(6,10))
catch_t.dissolve('OPCAT_NAME').plot(column='MNCAT_NAME',cmap='Blues',lw=0.15,ax=ax,label='Areas')
catch_t.dissolve('RBD_NAME').boundary.plot(color='xkcd:dark grey',ax=ax,lw=0.5,label='Regions')
ax.axis('off')
ax.set_title('Subareas of the Anglian catchment region\n used to classify river data in this workbook')
sns.set_theme(style="whitegrid")

Load the river pollution data¶

Download every 4 years between 2000 and 2016 for Sanitary and Nutrient measurements. Store the downlaoded data as Pickle objects and convert them to georeferences GeoPandas dataframes.

In the below cells, I also combine the measurement data with the river basin and catchment labels, so that each measurement has a label for area, subarea and catchment.

In [5]:
## which years do we want data for (every 4 years to limit how massive the dataset is)
years = np.arange(2000,2017,4)
In [17]:
labels = ['metals','nutrients','pesticides','pharmaceuticals','sanitary'] ## labels classifying the different measurements

groups = pl.read_csv('https://environment.data.gov.uk/water-quality/def/determinand-groups.csv?_sort=label') ## load the groupings of Determinands in a Polars dataframe
grps = {}
for label in labels:
    grps[label] = groups.filter(pl.col('label')==label)['member.label'][0].split('|')


## get determinant groups
def get_data(year,group='sanitary',material="RIVER / RUNNING SURFACE WATER"):
    url = 'https://s3-eu-west-1.amazonaws.com/swirrl-defra-wims-live-cache-prod/wims/persistent/'+year+'.csv' # build the URL for the given year
    data = pd.read_csv(url)
    data = data[data['sample.sampledMaterialType.label']==material] # filter just measurements on River water (exclude groundwater etc)
    res = data[data['determinand.label'].isin(grps[group])] # extract just data for the given measurement group (i.e. sanitary)
    res = res.groupby(['sample.samplingPoint.label','determinand.label']).mean().drop(columns=['codedResultInterpretation.interpretation','sample.isComplianceSample','determinand.notation']) # group by sample location and drop superfulous columns
    return res


def make_gdf(df):
    '''
    Take the Pandas dataframe retreived by get_data() and convert it to a georeferenced geopandas object.
    Also combine with river catchment and basin shapefiles to provide geographical labels (to make life easier when analysing the data) 
    '''
    
    gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df['sample.samplingPoint.easting'], df['sample.samplingPoint.northing']))
    gdf = gdf.set_crs('EPSG:27700')
    
    ## England river bains and catchment
    catch = gpd.read_file('/Users/starr/My Drive/Files/Code Bin/datasci_portfolio/river_polution/England_shapefile/WFD_River_Water_Body_Catchments_Cycle_3.shp')
    basin = gpd.read_file('/Users/starr/My Drive/Files/Code Bin/datasci_portfolio/river_polution/England_shapefile/WFD_River_Basin_Districts_Cycle_3.shp')
    catch_ = catch.dissolve(by='OPCAT_NAME').reset_index()[['OPCAT_NAME','geometry']].to_crs('EPSG:27700')    
    gdf = gpd.sjoin(gdf, catch_, how="left")
    gdf = gdf.drop(columns='index_right').sjoin(basin)
    
    return gdf.drop(columns=['index_right','RBD_ID','url'])

sanitary = {}
g_sanitary = {}
for year in years:
    sanitary[year] = get_data(str(year),'sanitary')
    g_sanitary[year] = make_gdf(sanitary[year])

nutrients = {}
g_nutrients = {}
for year in years:
    nutrients[year] = get_data(str(year),'nutrients')
    g_nutrients[year] = make_gdf(nutrients[year])
In [18]:
#### store the downloaded data for later use with Pickle 
import pickle
f = open('sanitary.pckl', 'wb')
pickle.dump([g_sanitary], f)
f.close()
f = open('nutrients.pckl', 'wb')
pickle.dump([g_nutrients], f)
In [6]:
### if retrieving pickles - run this cell
import pickle

## load pickled sanitary data
f = open('sanitary.pckl', 'rb')
pickled_sanitary = pickle.load(f)
f.close()

## load pickled nutrients data
f = open('nutrients.pckl', 'rb')
pickled_nutrients = pickle.load(f)
f.close()

# "unjar the pickles"
g_sanitary = {}
for year in years: 
    g_sanitary[year] = pickled_sanitary[0][year]
    g_sanitary[year] = g_sanitary[year].set_crs('EPSG:27700',allow_override=True) # set map projection label as British National Map Grid
    
g_nutrients = {}
for year in years: 
    g_nutrients[year] = pickled_nutrients[0][year]
    g_nutrients[year] = g_nutrients[year].set_crs('EPSG:27700',allow_override=True) # set map projection label as British National Map Grid   
In [7]:
## do some additional engineering of the catchment and basin shapefiles
catch_ = catch.dissolve(by='OPCAT_NAME').reset_index()[['OPCAT_NAME','geometry']].to_crs('EPSG:27700')    
basin = gpd.read_file('/Users/starr/My Drive/Files/Code Bin/datasci_portfolio/river_polution/England_shapefile/WFD_River_Basin_Districts_Cycle_3.shp')

Find statistics and create summary dataframes for the pollution measurements¶

  • retrieve the average measurement of each determinand by area and subarea
  • this greatly reduced the size of the dataframe and allows us to start plotting the pollution results by geographical region
In [8]:
def get_summaries(gdf):
    gdf = gdf.groupby(['RBD_NAME','OPCAT_NAME','determinand.label']).agg(['mean','std']).reset_index()
    gdf.columns = gdf.columns.droplevel()
    gdf.columns = ['REGION','BASIN','determinand.label','result_mean','result_std','sample.samplingPoint.easting_mean','sample.samplingPoint.easting_std','sample.samplingPoint.northing_mean','sample.samplingPoint.northing_std']
    return gdf

## summaries for Sanitary measurements
sanitary_stats = {}
for year in years:
    sanitary_stats[year] = get_summaries(g_sanitary[year])

## summaries for Nutrient measurements
nutrients_stats = {}
for year in years:
    nutrients_stats[year] = get_summaries(g_nutrients[year])

First exploratory look at the data¶

  • using Bokeh and Holoviews to make a quick and dirty interactive plot
In [9]:
import holoviews as hv
hv.extension('bokeh')
hv.renderer('bokeh').theme = 'dark_minimal'

def filtered_df(year,determinand, **kwargs):
    data = sanitary_stats[year]
    df = data[data['determinand.label']==determinand]
    return hv.Bars(df, hv.Dimension('REGION'), 'result_mean').opts(ylabel='Mean value',xrotation=90,framewise=True,fontscale=1, width=500, height=400,ylim=(0,df['result_mean'].max()*1.1))

determinands = sanitary_stats[2016]['determinand.label'].unique()

dmap = hv.DynamicMap(filtered_df, kdims=['year','determinand'])
dmap = dmap.redim.values(year=years,determinand=determinands)
hv.output(widget_location='top')
dmap.opts(framewise=True)
Out[9]:
In [10]:
hv.extension('bokeh')
hv.renderer('bokeh').theme = 'dark_minimal'

def filtered_df(year,determinand, **kwargs):
    data = nutrients_stats[year]
    df = data[data['determinand.label']==determinand]
    return hv.Bars(df, hv.Dimension('REGION'), 'result_mean').opts(ylabel='Mean value',xrotation=90,framewise=True,fontscale=1, width=500, height=400,ylim=(0,df['result_mean'].max()*1.1))

determinands = nutrients_stats[2000]['determinand.label'].unique()

dmap = hv.DynamicMap(filtered_df, kdims=['year','determinand'])
dmap = dmap.redim.values(year=years,determinand=determinands)
hv.output(widget_location='top')
dmap.opts(framewise=True)
Out[10]:

Analysis 1: Sanitary Measurements¶

Measurements used as indicators of "sanitary" pollution include:

  • Biological Oxygen Demand (BOD)
  • Ammonia
  • Suspended solids (at 105ºC)

I will start with BOD, which gives insight into the amount of oxygen required to breakdown organic matter in water. If BOD is too high, dissolved oxgyen starts decreasing and biological life starts dying! BOD is a good indicator of wastewater discharge and organic pollutants. High BOD might indicate poor wastewater/sewage treatment.

In [11]:
import matplotlib.pyplot as plt
import seaborn as sns
f,ax = plt.subplots(5,1,figsize=(8,12),sharex=True,sharey=True)
sns.set_color_codes("dark")

for a,year in zip(ax,years):
    sns.barplot(x='result_mean',y="REGION", data=sanitary_stats[year][sanitary_stats[year]['determinand.label']=='BOD ATU'],ax=a,palette='magma')
    a.set_title(year)
ax[0].set_title('Biological Oxygen Demand\n2000')

for a in ax:
    a.set_xlabel('')

sns.despine(bottom=True, left=True)
sns.set_theme(style="whitegrid")
ax[-1].set_xlabel('BOD (mg/L)')
Out[11]:
Text(0.5, 0, 'BOD (mg/L)')
In [12]:
## extract time series of BOD for each region
regions = sanitary_stats[2000]['REGION'].unique()
ts = {}
for region in regions:
    ts[region] = []
    for year in years:
        ts[region].append(sanitary_stats[year][(sanitary_stats[year]['REGION']==region)&(sanitary_stats[year]['determinand.label']=='BOD ATU')]['result_mean'].mean())


sns.set_theme(style="ticks")
        
f,ax = plt.subplots(10,1,sharex=True,figsize=(6,13))
palette = sns.color_palette("rocket",10)
for i,col,region in zip(range(0,10),palette,regions):
    ax[i].plot(years,ts[region],label=region,marker='o',c=col)
    ax[i].axhline(8,linestyle=':',color='k',alpha=0.6)
    ax[i].text(2017,8,'←"very polluted"',va='center',fontsize=8)
    ax[i].set_title(region,fontweight='normal',fontsize=10)
    ax[i].set_ylabel('BOD (mg/L)',fontsize=9)
    ax[i].set_xticks(years)
plt.subplots_adjust(hspace=0.4)

What's going on in the Anglian region?!¶

Average BOD levels are extremely high in the Anglian region, particularly in 2016. Is this because of a few extreme sample points pulling up the average?

Instead of a bar plot, try a "strip plot":

In [13]:
f,ax = plt.subplots(1,1,figsize=(12,4))

df = sanitary_stats[2016][sanitary_stats[2016]['determinand.label'] == 'BOD ATU']
sns.stripplot(
    data=df, x="result_mean", y="REGION", hue="REGION",palette='rocket',
    alpha=.75, zorder=1, legend=False
)

plt.xlabel('BOD ATU (mg/L)')
plt.title('BOD for 2016',fontweight='bold')
sns.despine(bottom=True, left=True)
sns.set_theme(style="whitegrid")
plt.axvline(600,color='brown',linestyle=':',lw=1,alpha=0.5)
plt.text(620,'Severn','Raw sewage',rotation=90,color='brown',fontsize=10)
Out[13]:
Text(620, Severn, 'Raw sewage')

Although generally measurements in Anglia are higher than the other regions, there are two points with astronomically high BOD measurements... like pure untreated sewage levels.... In fact, untreated sewage in Europe averages around 600 mg/L BOD (Sawyer, McCarty, and Parkin (2003). Chemistry for Environmental Engineering and Science)

Which rivers were these measurements taken in?

In [14]:
N_2000 = sanitary_stats[2000][(sanitary_stats[2000]['determinand.label']=='BOD ATU')&(sanitary_stats[2000]['REGION']=='Anglian')]
N_2012 = sanitary_stats[2012][(sanitary_stats[2012]['determinand.label']=='BOD ATU')&(sanitary_stats[2012]['REGION']=='Anglian')]
N_2016 = sanitary_stats[2016][(sanitary_stats[2016]['determinand.label']=='BOD ATU')&(sanitary_stats[2016]['REGION']=='Anglian')]

f,ax = plt.subplots(1,1,figsize=(5,10))
plt.errorbar(x=N_2000['result_mean'],y=N_2000['BASIN'],marker='o',label='2000',linestyle='-',ms=5,alpha=0.6)
plt.errorbar(x=N_2012['result_mean'],y=N_2012['BASIN'],marker='o',label='2012',linestyle='--',ms=5,alpha=0.6)
plt.errorbar(x=N_2016['result_mean'],y=N_2016['BASIN'],marker='o',label='2016',linestyle=':',ms=5,alpha=0.6)
plt.legend()

# plt.yticks(N_2000['BASIN'], N_2000['BASIN'], rotation='vertical');
plt.grid(alpha=0.3)
plt.title('BOD Measurements by River (Anglian Region)',fontweight='bold')


plt.axvline(600,color='brown',linestyle=':',lw=1,alpha=0.5)
plt.text(625,'Wensum','Raw sewage',rotation=90,color='brown',fontsize=10)
Out[14]:
Text(625, Wensum, 'Raw sewage')

This more granular view shows us that the extremely high BOD measurements were taken from the Old Bedford and Upper Great Ouse Rivers in 2016.

I don't know how rigorous the data quality protocol is for this dataset, nor do I know the particulars about where those measurements were taken (for example, measurements from a stagnant part of the river disconnected for the primary flow might give anomalously high results), however I think I'll be avoiding the Old Bedford and Great Ouse when planning my next river swimming trip....


Analysis 1b and c: Ammonia and Suspened Solids¶

Now for completeness, let's see how these measurement look for Suspended Solids and Ammonia:

In [15]:
f,ax = plt.subplots(1,1,figsize=(12,4))

df = sanitary_stats[2016][sanitary_stats[2016]['determinand.label'] == 'Ammonia(N)']
sns.stripplot(
    data=df, x="result_mean", y="REGION", hue="REGION",palette='rocket',
    alpha=.75, zorder=1, legend=False
)

plt.xlabel('Ammonia (mg/L)')
plt.title('Ammonia for 2016',fontweight='bold')
sns.despine(bottom=True, left=True)
sns.set_theme(style="whitegrid")
plt.axvline(3,color='brown',linestyle=':',lw=1,alpha=0.5)
plt.text(3.1,'South East','Reccomended effluent limit',rotation=90,color='brown',fontsize=10)
Out[15]:
Text(3.1, South East, 'Reccomended effluent limit')
In [16]:
f,ax = plt.subplots(1,1,figsize=(12,4))

df = sanitary_stats[2016][sanitary_stats[2016]['determinand.label'] == 'Sld Sus@105C']
sns.stripplot(
    data=df, x="result_mean", y="REGION", hue="REGION",palette='rocket',
    alpha=.75, zorder=1, legend=False
)

plt.xlabel('Suspended Solids (mg/L)')
plt.title('Suspended Solids for 2016',fontweight='bold')
sns.despine(bottom=True, left=True)
sns.set_theme(style="whitegrid")
plt.axvline(100,color='brown',linestyle=':',lw=1,alpha=0.5)
plt.text(105,'Thames','UK permitted\neffluent amount',rotation=90,color='brown',fontsize=10)

plt.savefig('suspended_solids.png',dpi=500,bbox_inches='tight')

Based on Suspended Solids and Ammonia, the Anglian region also performs poorly!¶


Analysis 2: Nutrient Measurement¶

Measurements used as indicators of "nutrient" pollution include (but not limited to):

  • Nitrite
  • Suphate
  • Chloride
  • Nitrogen
  • Phosphate
In [17]:
nutrients_stats[2016]['determinand.label'].unique()
Out[17]:
array(['Alky pH 4.5', 'Ammonia(N)', 'Chloride Ion', 'N Oxidised',
       'Nitrite-N', 'Orthophospht', 'N Oxid Filt', 'NH3 filt N',
       'Nitrogen - N', 'OrthophsFilt', 'Phosphorus-P', 'SiO2 Filt',
       'SiO2 Rv', 'Sulphate SO4', 'Phosphate', 'P - Filtered',
       'ALK.0.45um', 'CHLORIDE GFC', 'Nitrite Filt'], dtype=object)
In [19]:
f,ax = plt.subplots(1,1,figsize=(12,4))

df = nutrients_stats[2016][nutrients_stats[2016]['determinand.label'] == 'Phosphorus-P']
sns.stripplot(
    data=df, x="result_mean", y="REGION", hue="REGION",palette='rocket',
    alpha=.75, zorder=1, legend=False
)

plt.xlabel('Phosphorus Levels')
plt.title('Phosphorus Levels for 2016',fontweight='bold')
sns.despine(bottom=True, left=True)
sns.set_theme(style="whitegrid")
In [20]:
f,ax = plt.subplots(1,1,figsize=(12,4))

df = nutrients_stats[2016][nutrients_stats[2016]['determinand.label'] == 'Orthophospht']
sns.stripplot(
    data=df, x="result_mean", y="REGION", hue="REGION",palette='rocket',
    alpha=.75, zorder=1, legend=False
)

plt.xlabel('Orthosphate (mg/L)')
plt.title('Orthophosphate Levels for 2016',fontweight='bold')
sns.despine(bottom=True, left=True)
sns.set_theme(style="whitegrid")
In [21]:
## extract time series of BOD for each region
regions = nutrients_stats[2000]['REGION'].unique()
ts = {}
for region in regions:
    ts[region] = []
    for year in years:
        ts[region].append(nutrients_stats[year][(nutrients_stats[year]['REGION']==region)&(nutrients_stats[year]['determinand.label']=='Orthophospht')]['result_mean'].mean())


sns.set_theme(style="ticks")
        
f,ax = plt.subplots(10,1,sharex=True,figsize=(6,10))
palette = sns.color_palette("rocket",10)
for i,col,region in zip(range(0,10),palette,regions):
    ax[i].plot(years,ts[region],label=region,marker='o',c=col)
    ax[i].set_title(region,fontweight='normal',fontsize=10)
    ax[i].set_xticks(years)
    sns.despine()
    
for a in ax[:-1]:
    a.spines['bottom'].set_visible(False)
    a.xaxis.set_ticks_position('none')
    
    
for a in ax[::2]:
    a.set_ylabel('Orthophosphate\n(mg/L)',fontsize=9)

plt.subplots_adjust(hspace=0.2)
In [ ]: